Recommendations

###Part 1:

In this post I wanted to make a movie recommendation engine. I’ll just put it here at the top, in case you just want to try it out. Type in the exact title of a movie in the dataset into the textbox below:

«>>

Note that you must include the year as well because The Avengers was recommending The Russia House [1990] instead of Thor [2011] or something, since, apparently:

avengers 1 avengers 2

Note Note that this fix isn’t foolproof because, in 2008 for example:

journey JoUrNeY JOURNEY

I suspect that most movie recommendation engines learn from user interaction over time. Here I wanted to see how well I could do using only the data provided for the films. Things like the director list, the actor list, etc.

For my recommendation engine, I did some brief calculations and found the objectively speaking 100% provably correct determination of similarity between films to be:

recommendation formula

The first thing I did was weight the director as the heaviest influence in films’ similarities. This is obviously a personal preference, I consider directors to have more of an influence on films than actors, generally. There are some exceptions, for example any movie that Samuel L. Jackson is in automatically becomes a Samuel L. Jackson film. Similarly for Jack Nicholson, or Nicholas Cage, or any other number of highly distinct actors.

I gave actors the next most weight, enough so that two actors working together count for more than a similar director. So I would say that Shyamalan’s Unbreakable [2000] is more similar to Tarantino’s Pulp Fiction [1994] than it is to Shyamalan’s The Village [2004] if we just consider these two actors and directors.

I originally gave writers a lot of weight, but then I started noticing some writers that just get tossed in on huge group projects. So the writers in these groups have a less overall writer impact, since there are several writers, but they were jacking up the similarities anyways. If I went back and did it again, I would probably change the number of actors/writers/directors in common, in the equation, to the proportion of actors/writers/directors in common, to try and get around this problem.

Genre gets a small amount of weight, basically if I ended up with three amazingly identical films in terms of directors and actors, I wanted the suggestion to be in the same genre at least.

Next I ran into a problem where a film with no real actors or directors in common with any other film would spit back 5 random films with similar genres. These films were all equally as similar as hundreds of other films, so I had to go further in my formula.

I put a hard weight on language, so if you input a japanese movie, you should get a japanese movie back. Moderately better than a random a random German film with nothing in common except the same genres (or maybe not…).

Finally I just added two fuzzing factors, the release date difference between the two and the vote difference between the two. These should add some differentiation between any two movies so that the recommendation can at least use some info to suggest something even if there’s no actors, directors, writers, etc. in common. I used a square root and a square power for the difference between the date and vote respectively, so it will softly discriminate between years and sharply between ratings. Again, mathematically provably the best possible objective choice that in no way was just made up on the spot.

date fn vote fn

###Part 2:

I had another idea for attempting to find extremely disconnected movies which turned out to be surprisingly similar. Basically running a token comparison of the overviews of two different movies. So if the descriptions of two movies happen to include some unique words like, for example, “laser” and “planet,” we can find them even if they don’t share any actors, directors, etc.

howard plan9

To do this, I went through the descriptions of all of the films, and found out which proportion of each of those words was in the description. So if “laser” shows up twice in a 10-word description, its freq-film is 20% of that description. I also calculated how many times each word showed up in the set of all descriptions, so if “laser” shows up 100 times in a set of 1 million words of descriptions, its freq-total 0.01% of the total description, a rare word.

Then we can calculate a token distance between two films as:

token recommendation formula

This was interesting, and you can play around with it below, but I didn’t think it provided as good recommendations as the prior attempt. I suppose one could fold this in as an additional factor in the recommendation engine. Some matches are… surprising to say the least:

blade runner furious 7

This is probably a match because the token “Deckard” is wildly uncommon. I’m not too positive on a fix for that, but it leads to some humerous matches.



I’d like to thank the themoviedb.org folks, who gave me access to their API which was relatively painless to use. I am not affiliated with them in any way and my opinions are my own. I’d also like to thank the developers and maintainers of: Python, Bokeh, and matplotlib.

themoviedb.org python.org bokeh.pydata.org matplotlib.org
the movie db python bokeh matplotlib


Bokeh Plot